We are now entering the phase of doing serious statistics: we run regression models, inspect them, plot, and tabulate them.

For this exercise, let’s again pretend to be researchers interested in the determinants of the subjective likelihood estimation of being treated for COVID-19 in a hospital. We suspect that age plays an important role, something we have already assessed by visualizing a similar relationship. So… we need the GESIS Panel COVID-19 data again:

library(dplyr)
library(haven)
library(sjlabelled)

gp_covid <-
  read_sav(
    "../data/ZA5667_v1-1-0.sav"
  ) %>% 
  set_na(na = c(-1:-99, 97)) %>% 
  mutate(
    likelihood_hospital = hzcy003a,
    age_cat = as.factor(age_cat)
  ) %>% 
  remove_all_labels()

We start our analysis by checking the relationship with an ANOVA. We also include some very essential covariates: gender, political orientation, and marital status.

1

Run an ANOVA on the proposed relationship between the subjective likelihood of treatment in a hospital and age. Also, include all the covariates.
We had a similar analysis before… you can simply add the control variables with +.
serious_anova <- 
  aov(
    likelihood_hospital ~ age_cat + sex + political_orientation + marstat,
    data = gp_covid
  )

summary(serious_anova)
##                         Df Sum Sq Mean Sq F value Pr(>F)    
## age_cat                  9    346   38.40  23.868 <2e-16 ***
## sex                      1      1    0.99   0.617  0.432    
## political_orientation    1      0    0.11   0.071  0.789    
## marstat                  1      1    1.06   0.657  0.418    
## Residuals             3092   4974    1.61                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 660 observations deleted due to missingness

Alright, there’s definitely something in the data we should investigate. Sometimes, it is considered bad practice, but we could try to make a median cut for the dependent variable and run logistic regression.

2

Make a median cut for the variable “likelihood_hospital” and run a logistic regression with all variables. What do you see in the results?
You can create new variables fairly easily with the function dplyr::mutate() in combination with the ifelse() function.
gp_covid <-
  gp_covid %>% 
  mutate(
    likelihood_hospital_cut =
      ifelse(
        likelihood_hospital > median(likelihood_hospital, na.rm = TRUE),
        1,
        0
      )
  )

serious_logistic_regression <- 
  glm(
    likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
    data = gp_covid,
    family = binomial(link = "logit")
  )

summary(serious_logistic_regression)
## 
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat, family = binomial(link = "logit"), data = gp_covid)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.259  -1.008  -0.756   1.242   2.070  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.1525329  0.3987424  -5.398 6.73e-08 ***
## age_cat2               0.5570465  0.4001281   1.392  0.16387    
## age_cat3               0.2237200  0.4070216   0.550  0.58256    
## age_cat4               0.8552502  0.3851614   2.220  0.02638 *  
## age_cat5               1.0188140  0.3837725   2.655  0.00794 ** 
## age_cat6               1.2015271  0.3783196   3.176  0.00149 ** 
## age_cat7               1.5248309  0.3643355   4.185 2.85e-05 ***
## age_cat8               1.7367353  0.3739040   4.645 3.40e-06 ***
## age_cat9               1.9148385  0.3750669   5.105 3.30e-07 ***
## age_cat10              1.9574845  0.3732600   5.244 1.57e-07 ***
## sex                    0.0687836  0.0777806   0.884  0.37652    
## political_orientation  0.0247512  0.0206458   1.199  0.23059    
## marstat               -0.0007526  0.0481534  -0.016  0.98753    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4067.1  on 3104  degrees of freedom
## Residual deviance: 3888.4  on 3092  degrees of freedom
##   (660 observations deleted due to missingness)
## AIC: 3914.4
## 
## Number of Fisher Scoring iterations: 4
# Interpretation:
# With increasing age, the probability of thinking it's likely to be 
# hospitalized because of COVID-19 increases. But really?

Running one model is not enough. It may be that the assumptions for logistic regression are not fulfilled, or a reviewer simply doesn’t like these types of regressions. Instead, she proposes to run a binomial regression but with a cauchit link.

3

Run the same model with the sole difference of using a cauchit link. How would you interpret the different regression coefficients?
Have a look at the help page ?family to see how you can include a cauchit link.
serious_cauchit_regression <- 
  glm(
    likelihood_hospital_cut ~ age_cat + sex + political_orientation + marstat,
    data = gp_covid,
    family = binomial(link = "cauchit")
  )

summary(serious_cauchit_regression)
## 
## Call:
## glm(formula = likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat, family = binomial(link = "cauchit"), data = gp_covid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2523  -1.0064  -0.7601   1.2431   2.0485  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)   
## (Intercept)           -2.548363   0.849671  -2.999  0.00271 **
## age_cat2               1.036283   0.875953   1.183  0.23680   
## age_cat3               0.502607   0.912892   0.551  0.58193   
## age_cat4               1.427898   0.852017   1.676  0.09376 . 
## age_cat5               1.605221   0.848333   1.892  0.05846 . 
## age_cat6               1.792792   0.843912   2.124  0.03364 * 
## age_cat7               2.072772   0.837850   2.474  0.01336 * 
## age_cat8               2.246052   0.840343   2.673  0.00752 **
## age_cat9               2.386031   0.840525   2.839  0.00453 **
## age_cat10              2.419927   0.840032   2.881  0.00397 **
## sex                    0.046430   0.069014   0.673  0.50109   
## political_orientation  0.017770   0.018113   0.981  0.32655   
## marstat               -0.004667   0.040372  -0.116  0.90796   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4067.1  on 3104  degrees of freedom
## Residual deviance: 3889.1  on 3092  degrees of freedom
##   (660 observations deleted due to missingness)
## AIC: 3915.1
## 
## Number of Fisher Scoring iterations: 5
# Interpretation of difference:
# I don't know...

Ok, using different link functions is sometimes done as they provide different model fits. That’s definitely something we should investigate as well.

4

Compare both regression models with an ANOVA. Use the option test = "LRT" to perform a likelihood ratio test. What’s your interpretation?
A p-value considered as statistically significant would indicate a difference between the models.
anova(
  serious_logistic_regression, 
  serious_cauchit_regression,
  test = "LRT"
  )
## Analysis of Deviance Table
## 
## Model 1: likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat
## Model 2: likelihood_hospital_cut ~ age_cat + sex + political_orientation + 
##     marstat
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1      3092     3888.4                     
## 2      3092     3889.1  0 -0.72287
# Interpretation of difference:
# It seems there is no big difference. We can make the reviewer happy with the
# cauchit models and still keep our results.